Below is brief outline of the steps that need to be executed for the workflow
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
%cd /content/drive/MyDrive/yolov8
/content/drive/MyDrive/yolov8
To read a document on goole colab that isIt is easy to upload document to google drive but To read a document that is not on Google Drive and we don't want to upload it manually each time we can use Google Colab following steps
os.mkdir("/content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/Annotated")
This notebook is based on the ultralytics package and performs training on own custom objects and here i am using a document set and trying to tag tables from the scanned/screenshot images
To train the yolov8 follow steps:
Annotating the dataset : As a important step we need to annotate the dataset. Yolov8 basically need a file which has a same name of the image and annotation text images_1.jpg images_1.txt. here images_1.txt conatins the co-ordinates of a bounding box basically defining our object to be tagged. There are some tools which we can use like LabelImg which basically will create a same textfile name as of image but manually we need to draw one/multiple bounding to tag different parts that image conatins
Split dataset basically into train,test, validation folder
The developers of YOLOv8 decided to break away from the standard YOLO project design : separate train.py, detect.py, val.py, and export.py scripts. In the short term it will probably cause some confusion while in the long term, it is a fantastic decision!
This pattern has been around since YOLOv3, and every YOLO iteration has replicated it. It was relatively simple to understand but notoriously challenging to deploy especially in real-time processing and tracking scenarios.
The new approach is much more flexible because it allows YOLOv8 to be used independently through the terminal, as well as being part of a complex computer vision application.
Ultralytics YOLOv8.0.196 🚀 Python-3.10.12 torch-2.0.1+cu118 CUDA:0 (Tesla T4, 15102MiB) engine/trainer: task=detect, mode=train, model=yolov8m.pt, data=/content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/data.yaml, epochs=20, patience=50, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=None, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, vid_stride=1, stream_buffer=False, line_width=None, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, tracker=botsort.yaml, save_dir=runs/detect/train4 Downloading https://ultralytics.com/assets/Arial.ttf to '/root/.config/Ultralytics/Arial.ttf'... 100% 755k/755k [00:00<00:00, 10.9MB/s] Overriding model.yaml nc=80 with nc=2 from n params module arguments 0 -1 1 1392 ultralytics.nn.modules.conv.Conv [3, 48, 3, 2] 1 -1 1 41664 ultralytics.nn.modules.conv.Conv [48, 96, 3, 2] 2 -1 2 111360 ultralytics.nn.modules.block.C2f [96, 96, 2, True] 3 -1 1 166272 ultralytics.nn.modules.conv.Conv [96, 192, 3, 2] 4 -1 4 813312 ultralytics.nn.modules.block.C2f [192, 192, 4, True] 5 -1 1 664320 ultralytics.nn.modules.conv.Conv [192, 384, 3, 2] 6 -1 4 3248640 ultralytics.nn.modules.block.C2f [384, 384, 4, True] 7 -1 1 1991808 ultralytics.nn.modules.conv.Conv [384, 576, 3, 2] 8 -1 2 3985920 ultralytics.nn.modules.block.C2f [576, 576, 2, True] 9 -1 1 831168 ultralytics.nn.modules.block.SPPF [576, 576, 5] 10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 11 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1] 12 -1 2 1993728 ultralytics.nn.modules.block.C2f [960, 384, 2] 13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 14 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1] 15 -1 2 517632 ultralytics.nn.modules.block.C2f [576, 192, 2] 16 -1 1 332160 ultralytics.nn.modules.conv.Conv [192, 192, 3, 2] 17 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1] 18 -1 2 1846272 ultralytics.nn.modules.block.C2f [576, 384, 2] 19 -1 1 1327872 ultralytics.nn.modules.conv.Conv [384, 384, 3, 2] 20 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1] 21 -1 2 4207104 ultralytics.nn.modules.block.C2f [960, 576, 2] 22 [15, 18, 21] 1 3776854 ultralytics.nn.modules.head.Detect [2, [192, 384, 576]] Model summary: 295 layers, 25857478 parameters, 25857462 gradients, 79.1 GFLOPs Transferred 469/475 items from pretrained weights TensorBoard: Start with 'tensorboard --logdir runs/detect/train4', view at http://localhost:6006/ Freezing layer 'model.22.dfl.conv.weight' AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n... Downloading https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8n.pt to 'yolov8n.pt'... 100% 6.23M/6.23M [00:00<00:00, 107MB/s] AMP: checks passed ✅ train: Scanning /content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/train/labels... 238 images, 0 backgrounds, 0 corrupt: 100% 238/238 [00:00<00:00, 244.04it/s] train: New cache created: /content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/train/labels.cache albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8)) val: Scanning /content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/valid/labels... 70 images, 0 backgrounds, 0 corrupt: 100% 70/70 [00:00<00:00, 213.62it/s] val: New cache created: /content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/valid/labels.cache Plotting labels to runs/detect/train4/labels.jpg... optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... optimizer: AdamW(lr=0.001667, momentum=0.9) with parameter groups 77 weight(decay=0.0), 84 weight(decay=0.0005), 83 bias(decay=0.0) Image sizes 640 train, 640 val Using 2 dataloader workers Logging results to runs/detect/train4 Starting training for 20 epochs... Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 1/20 6.88G 1.357 3.079 1.502 43 640: 100% 15/15 [00:10<00:00, 1.39it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:03<00:00, 1.32s/it] all 70 109 0.318 0.561 0.373 0.295 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 2/20 7.09G 0.7192 1.578 1.094 42 640: 100% 15/15 [00:07<00:00, 1.90it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:04<00:00, 1.37s/it] all 70 109 0.536 0.65 0.566 0.387 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 3/20 7.12G 0.7845 1.416 1.071 53 640: 100% 15/15 [00:09<00:00, 1.64it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 2.28it/s] all 70 109 0.234 0.643 0.186 0.107 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 4/20 7.09G 0.8411 1.371 1.113 47 640: 100% 15/15 [00:07<00:00, 2.00it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:02<00:00, 1.21it/s] all 70 109 0.173 0.736 0.162 0.0855 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 5/20 7.11G 0.7543 1.192 1.067 55 640: 100% 15/15 [00:07<00:00, 2.07it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 2.21it/s] all 70 109 0.00882 0.293 0.00804 0.00437 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 6/20 7.09G 0.7599 1.094 1.072 49 640: 100% 15/15 [00:07<00:00, 2.14it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 2.66it/s] all 70 109 0.0318 0.373 0.0278 0.00899 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 7/20 7.13G 0.7945 1.102 1.071 61 640: 100% 15/15 [00:07<00:00, 1.97it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 1.95it/s] all 70 109 0.189 0.572 0.256 0.139 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 8/20 7.16G 0.7349 1.061 1.043 50 640: 100% 15/15 [00:08<00:00, 1.74it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 1.96it/s] all 70 109 0.917 0.217 0.468 0.322 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 9/20 7.08G 0.7576 1.022 1.044 61 640: 100% 15/15 [00:07<00:00, 2.02it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:02<00:00, 1.32it/s] all 70 109 0.534 0.635 0.592 0.443 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 10/20 7.11G 0.7777 1.091 1.051 42 640: 100% 15/15 [00:07<00:00, 2.05it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:02<00:00, 1.42it/s] all 70 109 0.533 0.563 0.617 0.418 Closing dataloader mosaic albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8)) Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 11/20 7.12G 0.6952 0.9462 1.042 20 640: 100% 15/15 [00:10<00:00, 1.48it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:02<00:00, 1.22it/s] all 70 109 0.592 0.733 0.677 0.503 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 12/20 7.12G 0.6666 0.9255 1.031 31 640: 100% 15/15 [00:07<00:00, 2.07it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 1.75it/s] all 70 109 0.651 0.774 0.694 0.56 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 13/20 7.09G 0.6255 0.9046 0.9958 26 640: 100% 15/15 [00:07<00:00, 1.98it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:02<00:00, 1.21it/s] all 70 109 0.678 0.752 0.666 0.544 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 14/20 7.11G 0.6584 0.9611 1.012 20 640: 100% 15/15 [00:07<00:00, 1.98it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 2.48it/s] all 70 109 0.67 0.858 0.794 0.599 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 15/20 7.11G 0.6331 0.8373 0.9879 26 640: 100% 15/15 [00:07<00:00, 1.93it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 2.83it/s] all 70 109 0.704 0.63 0.558 0.466 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 16/20 7.12G 0.5783 0.7323 0.9896 22 640: 100% 15/15 [00:07<00:00, 1.91it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 1.72it/s] all 70 109 0.944 0.836 0.911 0.781 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 17/20 7.08G 0.5305 0.6985 0.9461 29 640: 100% 15/15 [00:07<00:00, 2.03it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:02<00:00, 1.31it/s] all 70 109 0.89 0.862 0.908 0.821 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 18/20 7.11G 0.4854 0.7519 0.9068 17 640: 100% 15/15 [00:08<00:00, 1.70it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 2.08it/s] all 70 109 0.855 0.81 0.903 0.826 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 19/20 7.13G 0.4556 0.5932 0.9166 16 640: 100% 15/15 [00:08<00:00, 1.77it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 1.71it/s] all 70 109 0.935 0.854 0.936 0.873 Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 20/20 7.11G 0.4525 0.6147 0.8903 19 640: 100% 15/15 [00:08<00:00, 1.86it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:01<00:00, 2.03it/s] all 70 109 0.881 0.875 0.922 0.859 20 epochs completed in 0.083 hours. Optimizer stripped from runs/detect/train4/weights/last.pt, 52.0MB Optimizer stripped from runs/detect/train4/weights/best.pt, 52.0MB Validating runs/detect/train4/weights/best.pt... Ultralytics YOLOv8.0.196 🚀 Python-3.10.12 torch-2.0.1+cu118 CUDA:0 (Tesla T4, 15102MiB) Model summary (fused): 218 layers, 25840918 parameters, 0 gradients, 78.7 GFLOPs Class Images Instances Box(P R mAP50 mAP50-95): 100% 3/3 [00:02<00:00, 1.01it/s] all 70 109 0.935 0.854 0.936 0.874 bordered 70 23 0.965 0.826 0.933 0.901 borderless 70 86 0.905 0.882 0.938 0.848 Speed: 6.3ms preprocess, 10.1ms inference, 0.0ms loss, 3.6ms postprocess per image Results saved to runs/detect/train4 💡 Learn more at https://docs.ultralytics.com/modes/train
When the training is over, we can validate the new model on images it has not seen before. Therefore, when creating a dataset, we divide it into three parts, and one of them that we will use now as a test dataset.
!yolo task=detect \
mode=val \
model= {model_dir} \
data= {train_data}/data.yaml
Ultralytics YOLOv8.0.196 🚀 Python-3.10.12 torch-2.0.1+cu118 CPU (Intel Xeon 2.20GHz) Model summary (fused): 168 layers, 3006038 parameters, 0 gradients, 8.1 GFLOPs val: Scanning /content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/valid/labels.cache... 70 images, 0 backgrounds, 0 corrupt: 100% 70/70 [00:00<?, ?it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100% 5/5 [00:29<00:00, 5.81s/it] all 70 109 0.943 0.905 0.966 0.913 bordered 70 23 0.991 0.913 0.981 0.971 borderless 70 86 0.895 0.896 0.95 0.854 Speed: 6.5ms preprocess, 215.6ms inference, 0.0ms loss, 1.3ms postprocess per image Results saved to runs/detect/val5 💡 Learn more at https://docs.ultralytics.com/modes/val
!ls /content/drive/MyDrive/yolov7/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_5.png
detect detect3 output_page_1.png output_page_4.png output_page_7.png detect2 detect4 output_page_2.png output_page_5.png output_page_8.png detect2_new detect5 output_page_3.png output_page_6.png
!yolo task=detect \
mode=predict \
model={model_dir} \
conf=0.25 \
source={output_dir_} \
name={inf_dir} \
save_txt= True
Ultralytics YOLOv8.0.196 🚀 Python-3.10.12 torch-2.0.1+cu118 CPU (Intel Xeon 2.20GHz)
Model summary (fused): 168 layers, 3006038 parameters, 0 gradients, 8.1 GFLOPs
image 1/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_1.png: 640x512 (no detections), 197.6ms
image 2/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_2.png: 640x512 (no detections), 161.1ms
image 3/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_3.png: 640x512 (no detections), 153.7ms
image 4/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_4.png: 640x512 1 borderless, 158.4ms
image 5/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_5.png: 640x512 1 borderless, 156.1ms
image 6/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_6.png: 640x512 2 borderlesss, 157.2ms
image 7/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_7.png: 640x512 1 borderless, 148.4ms
image 8/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_8.png: 640x512 (no detections), 157.6ms
Speed: 4.5ms preprocess, 161.3ms inference, 1.5ms postprocess per image at shape (1, 3, 640, 512)
Results saved to /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/detect
4 labels saved to /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/detect/labels
💡 Learn more at https://docs.ultralytics.com/modes/predict
import cv2
import os
# Define the directory containing the images and text files
image_directory = output_dir
text_directory = inf_dir + "labels/"
output_directory = inf_dir + "TagTable/"
margin =15
# Create the output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)
# Optionally, you can add error handling, resizing, or other processing as needed
import cv2
import pytesseract
from PIL import Image
# Load the image with tables
image_directory = output_directory
# Output directory for saving OCR results
outputocr_directory = inf_dir + "OCRTable/"
# Create the output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)
#print(os.listdir(image_directory),image_directory)
# Loop through the image files in the directory
for image_filename in os.listdir(image_directory):
if image_filename.endswith(".jpg"): # Adjust the file extension as needed
image_path = os.path.join(image_directory, image_filename)
#print("image_path",image_path)
image = cv2.imread(image_path)
pil_image = Image.fromarray(image)
# Convert the image to grayscale for better OCR accuracy
#gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Use pytesseract to perform OCR on the grayscale image
#ocr_result = pytesseract.image_to_string(gray_image)
# Generate a filename for the OCR result text file
output_filename = os.path.splitext(image_filename)[0] + "_ocr.txt"
output_path = os.path.join(outputocr_directory, output_filename)
# Save the OCR result to the text file
with open(output_path, "w") as file:
file.write(table_str)
print("OCR results saved in the directory:", outputocr_directory)
OCR results saved in the directory: /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/detect/OCRTable/
import cv2
import os
from IPython.display import Image, display
# Mount Google Drive to access your data (optional, if your data is in Google Drive)
from google.colab import drive
drive.mount('/content/drive')
#os.rmdir("/content/drive/MyDrive/yolov7/document-parts-2/Annotated")
#os.mkdir("/content/drive/MyDrive/yolov7/document-parts-2/Annotated")
# Directory containing images and label files
anno_dir =train_data + "/Annotated"
data_dir = output_dir_
data_dir_labels = inf_dir + "labels"
file_list = os.listdir(data_dir)
print(file_list)
# Counter to limit the number of images displayed
display_count = 0
print("total number of images in this directory is :", len(file_list))
# Loop through each image in the directory
for image_file in file_list:
if image_file.endswith(".png"): # Adjust the file extension as needed
# Load the image
image_path = os.path.join(data_dir, image_file)
i
output_image_file = os.path.join(anno_dir, "annotated_" + image_file)
cv2.imwrite(output_image_file, image)
# Display the annotated image using IPython
print("Name is: ",display_count,image_file )
display(Image(output_image_file))
# Increment the counter
display_count += 1
# Check if we've displayed 100 images
#if display_count >= 100:
# break # Stop displaying images
# Print a message when the task is completed
print("Annotation and display completed.")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
['output_page_1.png', 'output_page_2.png', 'output_page_3.png', 'output_page_4.png', 'output_page_5.png', 'output_page_6.png', 'output_page_7.png', 'output_page_8.png', 'detect']
total number of images in this directory is : 9
output_page_4.png 1 0.498138 0.767796 0.64572 0.087686
Name is: 0 output_page_4.png
output_page_5.png 1 0.499508 0.240746 0.64542 0.184216 Name is: 1 output_page_5.png
output_page_6.png 1 0.499833 0.328019 0.64716 0.102913 output_page_6.png 1 0.506315 0.728177 0.662595 0.235299 Name is: 2 output_page_6.png
output_page_7.png 1 0.502882 0.357009 0.650774 0.101709 Name is: 3 output_page_7.png
Annotation and display completed.
## for one image only
import cv2
import pytesseract
from PIL import Image
# Load the image with tables
image_path = "/content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/detect/TagTable/output_page_5_table_1.jpg"
# 'Pdf_To_Images/TagTable/output_page_12_table_1.jpg'
# Print the extracted matrix
for row in table_matrix:
print(row)
['iable 2: summary of Keported and Matched Employment and rirm otructure'] ['Percentiles'] ['Variable Mean sD PS P25 P50 P75 P95'] ['Reported Total Employment'] ['Firm employment 307.4 753.7 1 7 57.5 179.5 | 1394.5'] ['Matched Employment and Firm Structure (without using the Census Multi-unit data)'] ['Employment 303.3 | 801.9 0.5 5 46.5 171 | 1324.5'] ['Number of EINs 17 1.8 0.5 1 1 1 5.5'] ['Number of 103 | 27.0 | 05 1 2 7 | 485'] ['establishments']